R Configuration

Below is a description of our R environment.

sessionInfo()
## R version 3.3.2 (2016-10-31)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X El Capitan 10.11.6
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## loaded via a namespace (and not attached):
##  [1] backports_1.0.5 magrittr_1.5    rprojroot_1.2   tools_3.3.2    
##  [5] htmltools_0.3.5 yaml_2.1.14     Rcpp_0.12.10    stringi_1.1.2  
##  [9] rmarkdown_1.3   knitr_1.15.1    stringr_1.1.0   digest_0.6.11  
## [13] evaluate_0.10

Our Data

The data sets used in this project were government issued data (www.data.gov) and census data retrieved from data.world.

The first data set includes information about every registered higher education institution in the United States (and US territories) from 2003 to 2013. Some of the information relevant to our project includes location (state), average SAT data, demographic information, number of undergraduates, and tuition information. The second data set was taken from the 2010 census data. It includes the population of each state. The third data set, also taken from 2010 census data, includes information on the number of residents over 25, high school graduates, and Bachelors degree holders in each state.

The final dataset was obtained from the Center for Disease Control and is entitled adult_obese in the data.world repository. The data contains obesity rates as percent of the adult and adolescent populations for all 50 states. It further includes confidence intervals and sample sizes for each state.

Accessing Our “Dirty” Data From data.world

The “dirty” version of our data can be accessed from data.world using the following R script.

source("../01_Data/data.world.R")
## Loading required package: jsonlite
## Loading required package: RCurl
## Loading required package: bitops
summary(df)
##       Year          UNITID        
##  Min.   :2003   Min.   :  100654  
##  1st Qu.:2005   1st Qu.:  164872  
##  Median :2008   Median :  214962  
##  Mean   :2008   Mean   :  966061  
##  3rd Qu.:2011   3rd Qu.:  418029  
##  Max.   :2013   Max.   :48285707  
##                                   
##                              F1SYSNAM    
##  NULL                            :38470  
##  #N/A                            : 6745  
##  ITT Educational Services Inc.   : 1161  
##  Corinthian Colleges Inc.        :  969  
##  EEG Inc.                        :  909  
##  Education Management Corporation:  899  
##  (Other)                         :29464  
##                          INSTNM               CITY           STABBR     
##  ITT TECHNICAL INSTITUTE    :  151   New York   :  800   CA     : 7744  
##  Marinello School of Beauty :  105   Chicago    :  726   NY     : 4981  
##  Lincoln Technical Institute:   84   Houston    :  656   TX     : 4564  
##  ITT Technical Institute    :   79   Los Angeles:  555   PA     : 4475  
##  Sanford-Brown Institute    :   66   Brooklyn   :  454   FL     : 4022  
##  Southwestern College       :   59   Miami      :  441   OH     : 3752  
##  (Other)                    :78073   (Other)    :74985   (Other):49079  
##       ZIP             main          NUMBRANCH        ADM_RATE_ALL  
##  90010  :   73   Min.   :0.0000   Min.   :  1.000   NULL   :47168  
##  7306   :   67   1st Qu.:1.0000   1st Qu.:  1.000   1      : 2843  
##  2116   :   66   Median :1.0000   Median :  1.000   0.816  :  149  
##  85021  :   62   Mean   :0.7858   Mean   :  4.821   0.7359 :  132  
##  11201  :   61   3rd Qu.:1.0000   3rd Qu.:  2.000   0.3449 :  122  
##  23462  :   61   Max.   :1.0000   Max.   :141.000   0.5    :   82  
##  (Other):78227                                      (Other):28121  
##     ACTCMMID      SAT_AVG_ALL         UGDS         UGDS_WHITE   
##  NULL   :64564   NULL   :61498   NULL   : 5806   NULL   :37453  
##  22     : 2149   1010   :  371   35     :  218   0      : 9611  
##  23     : 1923   1030   :  350   56     :  207   1      :  560  
##  21     : 1917   970    :  332   44     :  202   0.5    :   85  
##  24     : 1483   1050   :  327   40     :  196   0.8    :   69  
##  20     : 1281   990    :  307   55     :  192   0.6667 :   68  
##  (Other): 5300   (Other):15432   (Other):71796   (Other):30771  
##    UGDS_BLACK      UGDS_HISP       UGDS_ASIAN      UGDS_OTHER   
##  NULL   :37453   NULL   :37453   NULL   :37453   NULL   :37453  
##  0      :11021   0      :11505   0      :16029   0      : 5491  
##  1      :  138   1      :  725   0.0078 :  116   1      :  873  
##  0.0714 :   70   0.0625 :   66   0.0044 :  114   0.2    :   49  
##  0.0588 :   58   0.0278 :   63   0.0061 :  109   0.5    :   43  
##  0.1429 :   58   0.0455 :   63   0.0071 :  108   0.1    :   42  
##  (Other):29819   (Other):28742   (Other):24688   (Other):34666  
##  TUITIONFEE_IN   TUITIONFEE_OUT 
##  NULL   :31829   NULL   :33010  
##  18048  :  530   18048  :  530  
##  2700   :  116   17148  :  107  
##  17148  :  106   15600  :  104  
##  15600  :  104   9000   :  104  
##  15495  :  100   14880  :  101  
##  (Other):45832   (Other):44661

Cleaning Up Our Data

To make our data easier to understand and work with, we created the following ETL (extract-translate-load) script to edit our CSV file containing higher education institution data using the steps below.

  1. After cloning our repo, we set our working directory to the “00_Doc” folder.

  2. We then downloaded and saved our dataset from data.world to the “01_Data” folder. We renamed the file to “PreETL_merged_university_data”, and created a file path leading to this file so that it could be read into RStudio as a data frame. (See the next section for instructions on how to download data files from data.world)

  3. We looked through the data set and determined that the column names needed to be more descriptive. So we looked up what each of the variable names represented and renamed the variables so that they would be more descriptive. We did this by using the rename function from the plyr package.

  4. For consistency, we decided to alter the zip codes so that all zip codes would only contain 5 digits. We did this by using the substring function.

  5. Lastly, we changed all of the “NULL” values to read “NA in order to be more user friendly. This was accomplished by using the lapply and gsub functions.

  6. We saved a copy of our “clean” CSV file by removing “PreETL_” from its name, and uploaded our clean dataset back to data.world.

The code used to accomplish these steps can be found in detail in the following R script. Below is a sample of our “clean” CSV file.

source("../01_Data/ETL.Final.R")
## Parsed with column specification:
## cols(
##   .default = col_character(),
##   Year = col_integer(),
##   UNITID = col_integer(),
##   main = col_integer(),
##   NUMBRANCH = col_integer()
## )
## See spec(...) for full column specifications.
summary(df)
##       Year          Unit_ID                                System_Name   
##  2013   : 7804   100654 :   11   #N/A                            : 6745  
##  2012   : 7793   100663 :   11   ITT Educational Services Inc.   : 1161  
##  2011   : 7675   100690 :   11   Corinthian Colleges Inc.        :  969  
##  2010   : 7414   100706 :   11   EEG Inc.                        :  909  
##  2009   : 7149   100724 :   11   Education Management Corporation:  899  
##  2008   : 6975   100751 :   11   (Other)                         :29464  
##  (Other):33807   (Other):78551   NA's                            :38470  
##                     Institution_Name          City           State      
##  ITT TECHNICAL INSTITUTE    :  151   New York   :  800   CA     : 7744  
##  Marinello School of Beauty :  105   Chicago    :  726   NY     : 4981  
##  Lincoln Technical Institute:   84   Houston    :  656   TX     : 4564  
##  ITT Technical Institute    :   79   Los Angeles:  555   PA     : 4475  
##  Sanford-Brown Institute    :   66   Brooklyn   :  454   FL     : 4022  
##  Southwestern College       :   59   Miami      :  441   OH     : 3752  
##  (Other)                    :78073   (Other)    :74985   (Other):49079  
##    Zip_Code         Main      Number_of_Branches Admissions_Rate
##  Length:78617       0:16842   1      :55705      1      : 2843  
##  Class :character   1:61775   2      : 6871      0.816  :  149  
##  Mode  :character             3      : 3806      0.7359 :  132  
##                               4      : 2374      0.3449 :  122  
##                               5      : 1510      0.5    :   82  
##                               6      : 1130      (Other):28121  
##                               (Other): 7221      NA's   :47168  
##  Avg_ACT_Score   Avg_SAT_Score   Total_Undergraduates Percent_White  
##  22     : 2149   1010   :  371   35     :  218        0      : 9611  
##  23     : 1923   1030   :  350   56     :  207        1      :  560  
##  21     : 1917   970    :  332   44     :  202        0.5    :   85  
##  24     : 1483   1050   :  327   40     :  196        0.8    :   69  
##  20     : 1281   990    :  307   55     :  192        0.6667 :   68  
##  (Other): 5300   (Other):15432   (Other):71796        (Other):30771  
##  NA's   :64564   NA's   :61498   NA's   : 5806        NA's   :37453  
##  Percent_Black   Percent_Hispanic Percent_Asian   Percent_Other  
##  0      :11021   0      :11505    0      :16029   0      : 5491  
##  1      :  138   1      :  725    0.0078 :  116   1      :  873  
##  0.0714 :   70   0.0625 :   66    0.0044 :  114   0.2    :   49  
##  0.0588 :   58   0.0278 :   63    0.0061 :  109   0.5    :   43  
##  0.1429 :   58   0.0455 :   63    0.0071 :  108   0.1    :   42  
##  (Other):29819   (Other):28742    (Other):24688   (Other):34666  
##  NA's   :37453   NA's   :37453    NA's   :37453   NA's   :37453  
##  In-State_Tuition Out-State_Tuition
##  18048  :  530    18048  :  530    
##  2700   :  116    17148  :  107    
##  17148  :  106    15600  :  104    
##  15600  :  104    9000   :  104    
##  15495  :  100    14880  :  101    
##  (Other):45832    (Other):44661    
##  NA's   :31829    NA's   :33010

Downloading Our “Clean” Data from data.world

Once we cleaned up our data, we uploaded a new copy of it to data.world. To import this csv file for use in Tableau:

  1. On the Tableau homescreen, under “To a Server” click “Web Data Connector” on the left-hand Connect menu.

  2. For the URL, type “https://tableau.data.world” and press Enter.

  3. For the Dataset, type “mamilloy/s-17-dv-project-6”.

  4. Copy your personal data.world API Token into the next field and click “Get Dataset”.

  5. All csv files in the above dataset will appear on the left-hand side of the screen to be used.

Our Shiny Application

Our final Shiny application can be accessed at the following address: https://mamilloy.shinyapps.io/final_project/

Visualizations Using Crosstabs

In Tableau, we created a crosstabs visualization to analyze possible relationships between year and state. Specifically in this visual, we looked at the ratio of in-state tuition to total graduate students across the states. First, two data sets from our data.world profile needed to be joined. The university data had the number of graduate students and the 2010 Data_Update2 sheet had the in-state tuition information for each state. We completed an inner join on state of these two data sets in order to get the in-state tuition and total undergraduate students for each state. Across the x-axis of the cross tabs in year and along the y-axis is state. In each of the cells is the total number of undergraduates for that year and state. The KPI is the in-state tuition over the total number of undergraduates. The table is colored by KPI: the cells in orange are the low KPI, red is medium KPI, and blue is high KPI. This visual is interesting because it can be seen how the tuition for in-state students can be broken down across the average university population in that state. It really shows how high in-state tuition can be.

In-State Tuition vs. Total Undergraduates

In-State Tuition vs. Total Undergraduates

The shiny app was also created to mimic the Tableau visual. In order to join the data via the server.R, a SQL query was written that joined the two datasets, calculated the KPI, and separated KPI values into low, medium, and high. The KPI was calculated was the same as in the Tableau visual. The range values for low, medium, and high were set in the ui.R code to match the values set in Tableau.

In-State Tuition vs. Total Undergraduates

In-State Tuition vs. Total Undergraduates

Visualizations Using Barcharts

This first barchart displays the percentage of people working toward a bachelors degree per U.S. state. To create this plot, two datasets were connected using an inner join on each of the states. The first dataset, acsInfo, contained the number of bachelors students per state, and the second dataset, censusdata, containted each state’s total population. Both of these datasets were obtained from the U.S. Census Bureau’s 2010 census data. A calculated field was created to determine the percentage of those studying toward a bachelors degree in each state by dividing bachelors students by state population. Plotting states on the X axis and the calculated field on the Y axis created the final barchart. An average line was used to show where the average percentage of bachelors students falls across all states, which was 12.603%.

Bachelors Students per State

Bachelors Students per State

The above Tableau barchart was also created in Shiny by using ggplot. This Shiny app also displays the percentage of people working toward a bachelors degree in each U.S. state. This plot also contains an average line showing the average percentage of bachelors students across all states, with it’s value displayed in the side panel to the left of the barchart.

Bachelors Students per State

Bachelors Students per State

This next graph depicts the average percentage of students of each demographic category across all the institutions in each US state and territory. Demographic categories include White, Black, Hispanic, Asian, and other. The data is displayed as a stacked bar graph and was retrieved from “Clean_merged_university_data.csv”. In Tableau, this plot was made by dragging “State” into Rows and dragging the measure values for each demographic into Columns. To make this graph in Shiny, the plotly package was used to make a stacked bar graph. Using this package also offers the benefits of interactivity when hovering over columns.

Demographics Demographics

Visualizations Using Histograms

The following histogram was created in Tableau to show how many institutions fall within a certain range of in-state tuition on a yearly basis. The X axis represents bins in increments of $5000 for in-state tuition, and the Y axis represents the number of institutions that fall within each bin. Pages were also added to this visualization such that it could be observed how the number of schools within each bin changes from year to year. As tuition prices increased each year, a visible shift could be seen towards the right of the graph, and the 2003 graph’s right skew became less noticeable.

In-State Tuition Distribution

In-State Tuition Distribution

A similar histogram was also created in Shiny using ggplot and plotly. This histogram also shows how many instututions fall within a certain range of in-state tuition prices, but this histogram includes all data from 2003 to 2013. Hovering over each bin in the app will display each bin’s count. This histogram also includes a reference line representing the average in-state tuition price across all institutions.

In-State Tuition Distribution

In-State Tuition Distribution

Visualizations Using Scatterplots

This scatterplot tests the correlation of obesity to education level across all 50 states using the R packages ggplot and Shiny. The plot displays the percentages of obese people per U.S. state against the number of state citizens possessing a college degree over total state population. The plot reveals a very clear negative correlation between education level and obesity rate by state and is emphasized by the addition of a regression line. The data was amassed from 3 data.world tables that were double joined to make the needed query. The Shiny code (server.R and ui.R) allows for easy sharing of the visualization and the plotly R package imparts hovering functionality to make the graphic more interactive.

Obesity

Obesity

This Tableau graphic recapitulates the scatterplot correlating state obesity rate to education level. The data for the plot was collected via a web SQL query utilizing a double INNER JOIN (see associated image below) allowing the construction of the two measures: prcntObesity against the calculated field(100*(bachelors/population)). The plot was then filtered on state and a trend-line was added. As mentioned previously, a negative association is observed between state education level and obesity rate.

Obesity

Obesity

To build the above scatterplot, a web connection to our data.world repository was established and a double INNER JOIN between the adult_obese, censusdata and acsinfo tables was made that joined on the state variable. It is a graphical representation of the following SQL query: SELECT O.prcntObese, C.State_Population, O.Abbrv, A.bachelors, A.State FROM adult_obese as O JOIN censusdata as C on O.Abbrv = C.State_Avg JOIN acsInfo as A on A.State = O.Abbrv;

Obesity Join

Obesity Join

Using Shiny Actions, we also created a series of scatterplots displaying the average undergraduate SAT score for U.S.universities and grouped them together into categories based on what U.S. state or territitory the school is located . Each state runs along the X axis, and scores run along the Y axis. The scatterplot is also color coded by state with a legend to the right of the plot. Each university is represented by a single point, and each state’s average SAT score is represented by a black diamond. By clicking and dragging a selection box across a subset of the points in the scatterplot, a new scatterplot is created that displays only those points selected. This visualization was generated with the ggplot2 R package using its geom_point() and stat_summary() functions.

SAT Scores

SAT Scores

Visualizations Using Boxplots

This is a box plot that displays the average SAT scores for each school broken down by state. It was created using data from the ‘Clean-Merged-University” data set. We got the data on tableau from using the web connector and connecting through data.world. For the shiny app, we ran a query through data.world. This graphic is interesting because you can see how SAT scores vary across states. In tableau this graphic was created by putting state onto the Columns section, and Average SAT Score onto the rows section. Year was dragged onto color to see which year each score came from. In shiny, it was created using the plot and geom_boxplot functions.

Average SAT Scores Average SAT Scores